Red Wine Data Analysis by Raahul Seshadri

I don’t have much experience with wine. However, I did notice that we have a feature called “quality”, and my analysis will revolve around finding out correlations of other variables to quality.

Univariate Plots Section

We’ll go through the summary of the dataset and a few univariate plots to understand the structure of the data.

Summary of the dataset

Before doing a deep dive into the data, I pull up a summary of every single variable. This gives me, at a glance, the distribution, mean, median, min, and max values for all the variables. This will come in handy when I create univariate plots around these variables, because it’ll allow me to select parameters like the binsize.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
nrow(winedata)
## [1] 1599
ncol(winedata)
## [1] 13

We observe the following:

  1. There are 1599 observations across 13 features.
  2. As mentioned on the website, there are no missing values for any of the features.
  3. All the fields are numeric in nature.
  4. Quality is the only field that’s human-defined. Rest of the fields are chemical properties of the wine. It’d be interesting to find the relationship between these numeric variables and the quality.

Distribution of quality

Observations:

  1. Quality ranges from 3 to 8.
  2. No wine in the dataset has been rated with a quality of 1, 2, 9, and 10.
  3. Most of the wine have a quality rating of 5 or 6, which is much more in comparison to the good/bad wines.

Distribution of a few numeric features

I’ll first look at a few numeric features and their distributions. There are no categorical features. I will not be getting rid of any outliers in my univariate analysis, because I can’t be sure yet if they are errors, or whether the data was collected in such a way that we see outliers. For example, the distribution of quality is disproportionate as we saw above.

The bin-width has to set by looking at the min/max value. I’ll keep referring to the summary statistics to choose the right no. of bins or the binsize, depending on which is convenient.

Fixed acidity

Observations:

  1. The data is positively skwewed with mean at 8.32 (as per the statistics we calculated earlier).
  2. The median is 7.90.

Volatile acidity

Observations:

  1. Looks positively skewed with a mean of 0.5278, and median of 0.5200.

Citric acid

Observations:

  1. Definitely not normal. Although I tried the binwidth of 0.05 (for higher resolution), I settled on 0.1 so that I can see the trend clearly.
  2. Towards the end, I can see some missing values (somewhere between 0.75 and 1.0).

pH

The value of pH ranges from 0 to 14 (15 values). I’ve chosen 30 as the bin size so that I get a 0.5 resolution on the pH.

Observations:

  1. The distribution is again normal with mean 3.311 and median 3.310 (indicating no skew).

Residual sugar

Observations:

  1. Positively skewed normal with a lot of outliers on the positive end.
  2. By taking log10 on the x axis (2nd plot), we get to see how the outliers are distributed.

Density

I iterated on the number of bins from 10 to 20 to 40, till I was satisfied with the visual resolution.

Observations:

  1. Normally distributed around mean 0.9967.

Chlorides

Observations:

  1. A positively skewed normal distribution with a lot of outliers.
  2. By taking log10 on the x axis in the 2nd plot, we see the distribution of outliers much better. We clearly see a long tail to the right.
  3. Also, the 2nd plot makes apparent values to the very left, which is separate from the main distribution by a big break.

Alcohol

Observations:

  1. Positively skewed distribution.

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations across 13 features. As mentioned on the website, there are no missing values for any of the features.

All the fields are numeric in nature. Quality is the only field that’s human-defined. Rest of the fields are chemical properties of the wine.

Apart from citric acid, all the other plots were normally distributed, or normally distributed with a positive skew. I saw an unusually large tail of outliers in the case of residual sugar and chlorides.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest was the “quality”, since it was human-determined. There are no categorical variables of interest, so all numeric variables with normal distribution look about the same in a univariate analysis.

To get more information about which numeric features are of importance, we’ll need to do a bivariate analysis on the numeric feature and quality. The correlation between numeric features can also be found to detect redundance.

What other features in the dataset do you think will help support your

The dataset did mention that the quality is the median of at least 3 evaluations made by wine experts. It’d be interesting to know the difference in the scoring of the same wine between evaluations. Any difference above 1 on either side could be statistically significant and can indicate the presence of bias.

Did you create any new variables from existing variables in the dataset?

No, I did not.

Of the features you investigated, were there any unusual distributions?

I did not transform any of the data. Although the Udacity video mentioned that “normalizing” data is required to apply linear regression, it is incorrect. The data can be in any format, the error has to be normally distributed.

The only unusual distribution that I saw was for citric acid, which did not follow a normal pattern.

Bivariate Plots Section

I’ll first plot the correlations between all variables. This will help me inspect the right ones. I’ll only look at values that are >= 0.25 in either direction. Correlations lesser in magnitude than 0.25 can also tell us a lot about the dataset, but I’m confining my investigation to the major features.

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.26848392     -0.008815099
## fixed.acidity        -0.268483920    1.00000000     -0.256130895
## volatile.acidity     -0.008815099   -0.25613089      1.000000000
## citric.acid          -0.153551355    0.67170343     -0.552495685
## residual.sugar       -0.031260835    0.11477672      0.001917882
## chlorides            -0.119868519    0.09370519      0.061297772
## free.sulfur.dioxide   0.090479643   -0.15379419     -0.010503827
## total.sulfur.dioxide -0.117849669   -0.11318144      0.076470005
## density              -0.368372087    0.66804729      0.022026232
## pH                    0.136005328   -0.68297819      0.234937294
## sulphates            -0.125306999    0.18300566     -0.260986685
## alcohol               0.245122841   -0.06166827     -0.202288027
## quality               0.066452608    0.12405165     -0.390557780
##                      citric.acid residual.sugar    chlorides
## X                    -0.15355136   -0.031260835 -0.119868519
## fixed.acidity         0.67170343    0.114776724  0.093705186
## volatile.acidity     -0.55249568    0.001917882  0.061297772
## citric.acid           1.00000000    0.143577162  0.203822914
## residual.sugar        0.14357716    1.000000000  0.055609535
## chlorides             0.20382291    0.055609535  1.000000000
## free.sulfur.dioxide  -0.06097813    0.187048995  0.005562147
## total.sulfur.dioxide  0.03553302    0.203027882  0.047400468
## density               0.36494718    0.355283371  0.200632327
## pH                   -0.54190414   -0.085652422 -0.265026131
## sulphates             0.31277004    0.005527121  0.371260481
## alcohol               0.10990325    0.042075437 -0.221140545
## quality               0.22637251    0.013731637 -0.128906560
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                            0.090479643          -0.11784967 -0.36837209
## fixed.acidity               -0.153794193          -0.11318144  0.66804729
## volatile.acidity            -0.010503827           0.07647000  0.02202623
## citric.acid                 -0.060978129           0.03553302  0.36494718
## residual.sugar               0.187048995           0.20302788  0.35528337
## chlorides                    0.005562147           0.04740047  0.20063233
## free.sulfur.dioxide          1.000000000           0.66766645 -0.02194583
## total.sulfur.dioxide         0.667666450           1.00000000  0.07126948
## density                     -0.021945831           0.07126948  1.00000000
## pH                           0.070377499          -0.06649456 -0.34169933
## sulphates                    0.051657572           0.04294684  0.14850641
## alcohol                     -0.069408354          -0.20565394 -0.49617977
## quality                     -0.050656057          -0.18510029 -0.17491923
##                               pH    sulphates     alcohol     quality
## X                     0.13600533 -0.125306999  0.24512284  0.06645261
## fixed.acidity        -0.68297819  0.183005664 -0.06166827  0.12405165
## volatile.acidity      0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid          -0.54190414  0.312770044  0.10990325  0.22637251
## residual.sugar       -0.08565242  0.005527121  0.04207544  0.01373164
## chlorides            -0.26502613  0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.07037750  0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456  0.042946836 -0.20565394 -0.18510029
## density              -0.34169933  0.148506412 -0.49617977 -0.17491923
## pH                    1.00000000 -0.196647602  0.20563251 -0.05773139
## sulphates            -0.19664760  1.000000000  0.09359475  0.25139708
## alcohol               0.20563251  0.093594750  1.00000000  0.47616632
## quality              -0.05773139  0.251397079  0.47616632  1.00000000

The ggcorr function allows us to quickly visualize correlations between variables. Then, we can look at the table constructed earlier to narrow down on individual points and their values.

Below, I enumerate all correlations that I’d explore later (>= 0.25).

Observations on quality:

  1. Volatile acidity (-0.390557780)
  2. Sulphates (0.251397079)
  3. Alcohol (0.47616632)

Observations on fixed acidity:

  1. Volatile acidity (-0.25613089)
  2. Density (0.66804729)
  3. pH (-0.68297819)

Observations on volatile acidity:

  1. Fixed acidity (-0.25613089)
  2. Citric acid (-0.55249568)
  3. pH (0.23493729) (this is not more than 0.25, but it’s positive for some reason)
  4. Sulphates (-0.260986685)
  5. Quality (-0.39055778)

Observations on citric acid:

  1. Fixed acidity (0.67170343)
  2. Volatile acidity (-0.552495685)
  3. Density (0.36494718)
  4. pH (-0.54190414)
  5. Sulphates (0.312770044)

Observations on residual sugar:

  1. Density (0.35528337)

Observations on chloride:

  1. pH (-0.26502613)
  2. Sulphates (0.371260481)

Observations on free sulphur dioxide:

  1. Total sulphur dioxide (0.66766645)

Observations on total sulphur dioxide:

  1. Free sulphur dioxide (0.667666450)

Observations on density:

  1. Fixed acidity (0.66804729)
  2. Citric acid (0.36494718)
  3. Residual sugar (0.355283371)
  4. pH (-0.34169933)
  5. Alcohol (-0.49617977)

Observations on pH:

  1. Fixed acidity (-0.68297819)
  2. Citric acid (-0.54190414)
  3. Chlorides (-0.265026131)
  4. Density (-0.34169933)
  5. Volatile acidity (0.23493729) (this is not more than 0.25, but it’s positive for some reason)

Observations on sulphates:

  1. Volatile acidity (-0.260986685)
  2. Citric acid (0.31277004)
  3. Chlorides (0.371260481)
  4. Quality (0.25139708)

Observations on alcohol:

  1. Density (-0.49617977)
  2. Quality (0.47616632)

Analysis of features with strong correlation with quality

Let’s take a look at features that have a correlation with quality, namely: Volatile acidity, sulphates, alcohol.

Alcohol and quality

It’s difficult to decipher the above scatterplot. While we know that there’s a correlation, a boxplot might give us more insights.

Observations:

  1. The box plot is much clearer in showing that higher quality wines have a higher alcohol content.
  2. Quality level 5 has the highest number of outliers. It’s hard to say why this is the case as of now.

Sulphates and quality

Observations:

  1. While the relationship is not as strong as alcohol vs. quality, sulphates do show positive correlation to quality.
  2. Thus, higher quality alcohols tend to have higher medians of sulphates.
  3. Quality 5 again has the highest number of outliers. It could be that average alcohols were the most common choice in the dataset by the human reviewers, and these people might have decided to err on the side of average.

Volatile acidity and quality

Observations:

  1. A clear negative trend. Higher quality alcohols have lower amount of volatile acidity.
  2. Quality score 5 has a quantity of outliers comparabable to Quality 6, but more spaced out. When it comes to quality score of 5, the variables we’ve analysed so far say very little about the quality in isolation. A combination of these features could tell us more about why a certain alcohol got a quality rating 5.

Relationship between various acidity parameters

I see three features on acidity, “fixed.acidity”, “volatile.acidity”, “citric.acidity”. We also know their correlation with each other. Let’s visualize it.

Observations:

  1. These observations corroborate the correlation coefficients that we calculated earlier.
  2. Fixed and volatile acidity don’t have a strong correlation (pearson coeff of magnitude 0.25)
  3. However, fixed and citric show signs of positive correlation (pearson coeff of 0.67)
  4. Volatile and citric acid show signs of negative correlation (pearson coeff of -0.55)

pH and density

Let’s look at a boxplot for a better view of the correlation.

In order to make the boxplot feasible, I multiply the pH by 4 and find the “ceiling.” This gives me an integer, which I then divide by 4 again. In essence, pH gets grouped in sections of 0.25, and allows me to plot a boxplot by treating the pH groups as a factor.

Observations:

  1. With increasing pH, we observe decreasing density.

Volatile acidity vs. pH

For some reason, volatile acidity has a positive correlation with pH.

For quality score 6, the anomalous relationship between pH and acidity is even more apparent.

Observations:

  1. Since lower pH should generally mean higher acidity, this relationship is anomalous.
  2. However, we’re talking about volatile acidity here. As per “http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity”, there is a specific method to measuring volatile acidity. Thus, I’m assuming that you cannot detect its presence solely by using a “litmus” test.
  3. Thus, the relationship between pH and volatile acidity might not really exist, and could be driven a third variable instead (pH might be measuring only fixed acidity). For example, fixed acidity shows negative correlation with pH, and fixed acidity also shows negative correlation with volatile acidity. By extension, pH is bound to show positive correlation with volatile.acidity. But it could be fixed acidity driving the relation, whereas there is no causation between volatile.acidity and pH.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features in
the dataset?

  1. There were strong correlations of alcohol, volatile acidity, and sulphates with quality.
  2. There seemed to be an unusual positive correlation between volatile acidity and pH, but it turned out to be a red herring. It was fixed acidity’s correlation with pH and volatile acidity, whereas volatile acidity is measured using specialized methods.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

  1. The most interesting one was between pH and volatile acidity as I explained above.

What was the strongest relationship you found?

  1. The relationship between fixed.acidity and pH was the strongest, with a pearson coefficient of 0.68. However, this relationship is not a surprise and rather scientific.
  2. However, the strongest pertinent one was between alcohol and quality, with a pearson coefficient of 0.47.

Multivariate Plots Section

Since quality is the only categorical variable, all the multi-variate plots will have to be faceted around it. We’ll explore some of the interesting relationships from the bivariate section but faceted on quality

Explore relationship between acidity for different qualities

Observations:

  1. For different qualities, the geom_smooth line looks similarly sloped. So, quality does not seem to have an impact in how these quantities vary.
  2. Fixed acidity and volatile acidity have negative correlation.
  3. Citrict acid and fixed acidity have positive correlation.

Relationship between fixed acidity and volatile acidity for different pH

Let’s explore the seemingly anomalous relationship in detail. In the bivariate section, I had hypothesized that fixed acidity is the lurking variable in the relationship between pH and volatile acidity.

Observations:

  1. For different groupings of pH, fixed and volatile acidity either show positive or negative correlation.
  2. This explains the seeming anomaly between volatile acidity and pH. Fixed acidity and pH have a strong negative correlation as expected. Hoewver, volatile acidity’s correlation with fixed acidity is different depending on the pH level. As I corroborated earlier, volatile acidity might not have a direct correlation to pH due to how it’s measured.

Examine features having strong correlation to quality

Volatile acidity, sulphates, and alcohol had strong correlation to quality. We’ll examine the relationships between these three variables faceted by quality.

Observations:

  1. We can observe that less volatile acidity correlates with better wine quality.
  2. Also, increase in concentration of alcohol correlates with better wine quality.
  3. This corroborates what we found earlier in the bivariate analysis.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

  1. I observed that quality correlated positively with alcohol, and negatively with volatile acidity.
  2. Fixed acidity has positive correlation with citric acid across all quality levels.
  3. Fixed acidity has negative correlation with vilatile acid across all quality levels.

Were there any interesting or surprising interactions between features?

  1. By looking at fixed acidity and volatile acidity across different pH, we could see that the relationship is sometimes positively correlated and sometimes negatively correlated depending on the pH. We could finally find out why we got a positive correlation between pH and volatile acidity, which corroborates what we found in bivariate analysis.

Final Plots and Summary

I’m choosing the 3 plots that represent a strong and pertinent relationship.

Alcohol vs. quality

This is a very important plot, because it shows the relationship between the quality of the wine and the quantity of alcohol that it contains. It shows that more quantity of alcohol is correlated with better quality wine.

If we were designing a model, I’d put alcohol, volatile acidity, and sulphates in as the main features and check what accuracy I get, before pulling in other features.

Fixed acidity vs. volatile acidity faceted by pH

Earlier, I saw an interesting relationship between volatile acidity and pH. Generally, lower pH suggests higher acidity. However, pH and volatile acidity had a positive correlation, which didn’t make sense at first.

Also, fixed acidity has a strong negative correlation with pH, as it was expected. My theory was that the observed relationship between volatile acidity and pH was driven by some third variable (fixed acidity), and they weren’t really directly correlated.

This is one of the plots that shows why the data is the way it is. For different values of pH, we see both positive and negative correlation between volatile and fixed acidity. It means that the amount of volatile acidity really doesn’t depend on fixed acidity, or else we would have seen correlation in only one direction for all values of pH. This means that pH is not really measuring the volatile acidity, but the dataset’s construction leads to this red herring correlation between pH and volatile acidity.

Fixed vs. citric acidity

I selected this plot because when we go ahead and train a regression model to predict quality, we’d want all of our variables to be as uncorrelated as possible. Or else, it indicates a certain degree of redundancy. Increase in dimensionality of features would require more amount of data to maintain accuracy.

In cases of fixed and citric acid, I see an almost linear relationship. In such cases, instead of having both of them as features, I’d just have one feature that’s the ratio of the two, or just pick one of the two. This way, the regressor has to deal with one less feature.

As we saw above, the same is not true for fixed and volatile acidity, so I’d leave them separate and untouched.

Plots like this allows us to make important decisions on reducing dimensionality, checking if PCA would help, etc. Basically, with lesser dimensions and redundancy, our model will be able to learn better with the same amount of data.


Reflection

My Approach

When I started with the exploration, the first thing that I noticed was that all the features were numeric. The target variable, Quality, was the only categorical variable.

I had to adjust my mindset to this dataset, because having categorical features helps with plotting interesting multivariate plots. Because the features were numerical, I had to come up with artificial factors by grouping continuous variables into discrete intervals. I did this in some areas where my domain knowledge helped me get the right interval, for example, the pH. Quality was the other variable that I used to do multivariate plots.

That target that I set for myself was to find out what influences the quality. When I read the description of quality from the website that I obtained the dataset from, it said that the quality was determined from evaluation by 3 different wine experts. However, it would have been immensely informative for me to have those 3 scores separately, rather than just the mean or media.

In the univariate section, I was more interested in getting to know the structure of the data. I started with quality, so that I knew the “class distribution.” As I expected, it turned out to resemble a normal distribution. Knowing this was critical, because it influenced how I interpreted the other univariate distributions.

For example, if I ever saw a normal distribution for another feature, I took it as a candidate that could predict the quality, because its distribution matched that of the quality, and could thus have differentiating information encoded in it. Of course, correlation was used to confirm to what degree a variable affected quality, but I’m just focusing on the visual aspects.

Univariate distributions that would have been uniform or partly uniform, might not have helped much predicting the wine quality, except for a subset of target classes (they’d only have helped with identifying, say, higher quality wine, whereas being uniform for other wines).

The bivariate analysis was an interesting opportunity to learn about interrelationships between variables. Since there were a lot of combinations of variables, I started out by calculating correlations between all of them, so that I could narrow down on a few combinations that I’d end up visualizing.

This is where I investigated strong relationships of features with the quality. It was also a good way of finding redundancies between features (high correlation). When modelling, we deal with the curse of dimensionality, and bivariate plots help us confirm if we can reduce the dimensionality of features by either eliminating the redundant ones, or pulling in linear combinations of variables instead of having them separately.

I also started investigating the relationship between volatile acidity and pH. Although my initial assumption was that they should be negatively correlated, but I was wrong in my assumption of how the pH was being measured, and how volatile acidity is measured. I drew a couple of plots around it.

The multivariate section was more of a way to reaffirm observations from the bivariate section. I only faceted on quality (which is the only categorical variable) and pH (because I had the domain knowledge to know that grouping intervals of pH could lead to interesting plots).

I selected the final plots to bring out correlation with quality, which was our target variable. But instead of just fixating on quality, I also picked out plots that showed strong correlation, hence redundancy, in the data. Since the volatile acidity vs. pH was initially confusing due to my assumption, I included that plot as well to demonstrate analysis that required several graphs to understand.

How could the analysis be improved in future work

I found that quality 5 and 6 were the most common wine entries in the dataset. In comparison, wine quality 3 & 4, and 7 & 8, has fewer records. Just like we group age into intervals of 10 years or 20 years for better analysis, we can group these levels into 3 quality levels: low, average, high. It would be interesting to see quality vs. other parameters under this new “grouped” quality levels.

Adding to it, I would perform cluster analysis on all the wines, with the no. of clusters as 3 (since we have 3 levels of quality: low, average, high). Of course, it’s not guaranteed that the dataset would get clustered as per their quality. If they seem noisy, I’d increase the number of clusters to 6 or 9, to get a sense of what linear combinations affect the quality.

Within each cluster, I’d then plot the distribution of quality. Depending on what I see, I would have continued more exploration.

Apart from that, I could have trained a overfitting decision tree or had done PCA to find out which features were more important to the determination of quality, and tallied it with what I thought were more important through the analysis that I’ve done above.